Once I shared my first article about “Getting Started with Polars” I started thinking about something exciting: comparing the speed of Polars and Pandas. This curiosity was sparked by all the buzz around the brand-new Pandas 2.0 release, promising lightning-fast performance. Pandas 2.0 was announced to come packed with cool features, including the addition of Apache Arrow (pyarrow) as its backing memory format. The big perk of Apache Arrow is that it makes operations speedier and more memory-friendly. Naturally, this got me wondering: how does Pandas 2.0 measure up against Polars? Let’s dive in and find out!
Warning
Keep in mind: occasionally, I refer to some performance difference figures. Even though I repeated these experiments 100 times each, because computations can have a bit of randomness, there might be slight variations in the exact numbers.
Setup
Code
import pandas as pdimport polars as plimport polars.selectors as csimport numpy as npimport timeitimport randomimport plotly.io as pioimport plotly.express as pxpio.templates.default ="presentation"print(pd.__version__)print(pl.__version__)
2.0.3
0.18.11
Note
Throughout this article, we’ll be working with the latest versions of the packages: Pandas 2.0.3 and Polars 0.18.11.
Generate the synthetic data
For the purpose of benchmarking, we’re going to create our own dataset. To ensure simplicity and uniformity, we’ll generate a dataframe containing columns like name, year of birth, and city of residence – pretty straightforward stuff. We’re aiming for a dataset of 10,000 rows. While we could certainly run this benchmark on a dataframe with more than a million rows, our intention is to provide a realistic example. The code we’ll provide allows you to effortlessly explore various scenarios according to your requirements. Feel free to use it or modify it if you want to perform your own test.
Code
def create_sample_dataframe(n_rows: int, name: str, seed: int=42) -> pl.DataFrame:""" Create a sample Polars DataFrame with the specified number of rows. This function generates random data for several columns and creates a Polars DataFrame with the specified number of rows. The generated data includes information about names, birth years, cities, zip codes, incomes, marital status, number of children, and car brands. The resulting DataFrame is written to CSV and Parquet files named 'sample.csv' and '{name}.parquet', respectively. Parameters ---------- n_rows : int The number of rows to generate in the sample DataFrame. name : str The name to be used when writing the Parquet file. seed : int, optional The seed to use when generating random data. Default is 42. Returns ------- pl.DataFrame The first 5 rows of the generated sample DataFrame. """ random.seed(seed) np.random.seed(seed) data = {"name": np.random.choice(["Alice", "Bob", "Charlie", "David", "Eva"], n_rows),"born": np.random.randint(1950, 1990, size=n_rows),"city": np.random.choice( ["Shanghai", "San Francisco", "London", "Munich", "Mumbai"], n_rows ),"zip_code": np.random.randint(10000, 99999, size=n_rows),"income": np.random.normal(50000, 10000, size=n_rows),"is_married": np.random.choice([True, False], n_rows),"children": np.random.randint(0, 5, size=n_rows),"car": np.random.choice(["Ford", "BMW", "Toyota", "Bentley", "Mini"], n_rows), } temp_df = pl.DataFrame(data) temp_df.write_csv("sample.csv") temp_df.write_parquet(f"{name}.parquet")return temp_df.head()create_sample_dataframe(n_rows=10_000, name="sample", seed=1)
shape: (5, 8)
name
born
city
zip_code
income
is_married
children
car
str
i32
str
i32
f64
bool
i32
str
"David"
1953
"Shanghai"
35716
69693.044244
false
4
"Bentley"
"Eva"
1984
"London"
30931
30547.027839
true
1
"Toyota"
"Alice"
1979
"San Francisco"
85967
63489.882712
false
1
"Mini"
"Bob"
1969
"San Francisco"
64527
56030.692112
false
1
"Toyota"
"David"
1979
"Munich"
53316
51118.005552
true
3
"Ford"
Define helper class
In order to streamline our code, we’ll create a convenient helper class featuring two essential methods: get_time and get_figure. The get_time method leverages the timeit.repeat function to accurately gauge the execution time of our code. In this context, we’ll set the parameters number = 1 and repeat = 100. This configuration runs the operations 100 times, allowing us to derive key statistics like the mean, median, and standard deviation.
On the other hand, the get_figure method takes these gathered results and generates a visually appealing image using Plotly.
To bring it all together and keep things simple, we’ve designed the run_test function. This function orchestrates the entire process, seamlessly merging the steps outlined above.
Code
class Timeit:""" A class for measuring the execution time of multiple functions. This class provides methods for measuring the execution time of multiple functions using the `timeit` module. The functions to be tested are passed to the constructor as a dictionary, where the keys are the names of the functions and the values are the functions themselves. The `number` and `repeat` parameters control how many times each function is executed and how many times the timing is repeated, respectively. The `get_time` method measures the execution time of each function and stores the results in a dictionary, where the keys are the names of the functions and the values are lists of execution times. The `get_figure` method generates a box plot of the execution times using the `plotly.express.box` function. Parameters ---------- func_dict : dict A dictionary of functions to test, where the keys are the names of the functions and the values are the functions themselves. number : int, optional The number of times to execute each function in each timing run (default is 1). repeat : int, optional The number of times to repeat each timing run (default is 20). Attributes ---------- results : dict or None A dictionary of execution times for each function, where the keys are the names of the functions and the values are lists of execution times. This attribute is `None` until `get_time` is called. Methods ------- get_time() Measure the execution time of each function and store the results in the `results` attribute. get_figure() Generate a box plot of the execution times using `plotly.express.box`. """def__init__(self, func_dict, number=1, repeat=100):self.func_dict = func_dictself.number = numberself.repeat = repeatself.results =Nonedef get_time(self):""" Measure the execution time of each function and store the results. This method uses the `timeit.repeat` function to measure the execution time of each function in `func_dict`. The results are stored in a dictionary, where the keys are the names of the functions and the values are lists of execution times. This dictionary is also stored in the `results` attribute of the `Timeit` object. Returns ------- dict A dictionary of execution times for each function. """ results = {}for k, v inself.func_dict.items(): results[k] = timeit.repeat(v, number=self.number, repeat=self.repeat)self.results = resultsreturn resultsdef get_figure(self):""" Generate a box plot of the execution times. This method uses `plotly.express.box` to generate a box plot of the execution times stored in `results`.If `results` is `None`, this method raises a `ValueError`. Returns ------- plotly.graph_objs.Figure A box plot figure object. Raises ------ ValueError If no results are available (i.e., if `get_time` has not been called). """ifself.results isNone:raiseValueError("No results available. Please run get_time first.") df = pd.DataFrame(self.results) mean_times = df.median() *1000 title =f'Median Execution Times: <br><sup>{", ".join([f"{k}={v:.6f}ms"for k,v in mean_times.items()])}</sup>'return px.box( df, points="all", labels={"value": "time (sec)", "variable": ""}, title=title, width=600, height=500, )def run_test(test: dict, name: str):""" Run a timing test on a dictionary of functions using the Timeit class and return the results as a DataFrame. This function takes a dictionary of functions as input, where the keys are the names of the functions and the values are the functions themselves. It creates a Timeit object with this dictionary and uses it to measure the execution time of each function. The results are then displayed as a box plot using the `get_figure` method of the Timeit object and returned as a Pandas DataFrame with columns renamed according to the `name` argument. Parameters ---------- test : dict A dictionary of functions to test, where the keys are the names of the functions and the values are the functions themselves. name : str A string used to rename the columns of the returned DataFrame. Returns ------- pd.DataFrame A DataFrame containing the results of the timing test, with columns renamed according to the `name` argument. """# Create a Timeit object t = Timeit(test)# Measure the execution time of the functions results = t.get_time()# Generate a box plot of the execution times fig = t.get_figure() fig.show()return pd.DataFrame(results).rename( columns={"Pandas": f"{name}_pandas", "Polars": f"{name}_polars"} )
Reading in data From a CSV File and parquet
CSV
Reading CSV files from disk is a task that data scientists often find themselves doing. Now, let’s see how these two libraries compare for this particular job. To maximize the blazing-fast data handling capabilities of PyArrow, we’ll equip Pandas with the engine="pyarrow" and dtype_backend="pyarrow" arguments. Let’s see how these choices shape the performance!
For the sake of comparison, we’ll also demonstrate the timeit function invoked using Jupyter cell magic. You’ll notice that the numbers generated this way are quite closely aligned with ours.
Polars unquestionably wins this round, it can boast a speed advantage of 2 to 4 times over Pandas.
Selecting Columns
Alright, let’s spice things up a bit and select some columns to see which library races ahead in terms of speed!
Files awaiting in-memory reading
A clever approach to conserve memory and enhance speed involves reading only the columns essential for operations. Consider a scenario where we’re interested in displaying just the names from this dataset. The big question now: how do these libraries measure up in terms of speed? Let’s find out!
As anticipated, Polars continues to showcase its swiftness. It’s worth highlighting the usage of the lazy and collect methods in Polars. These nifty tools grant us access to the library’s clever query optimization techniques, which play a pivotal role in significantly enhancing performance. OK, one step further: suppose our files are already loaded into memory. Would there still be a distinction in performance under this circumstance?
While Polars showed a significant speed advantage for tasks involving pre-read files, both libraries perform similarly when the files are already in memory.
Filtering Rows
Now, let’s explore the scenario where we filter our dataset based on one or more column values
Based on one condition
For our simple scenario, we’ll be narrowing down our focus to filter data based on individuals with the name “David”.
Now, for a more intricate challenge, we’re going to dive into querying the data to extract individuals who meet specific criteria: those named David, born after 1980, residing in a city other than London, married, and with three children.
Code
test_dict = {"Pandas": lambda: ( df_pandas.query("name=='David' and born>1980 and city != 'London' or is_married == True and children >= 3" ) ),"Polars": lambda: ( df_polars.lazy() .filter( (pl.col("name") =="David")& (pl.col("born") >1980)& (pl.col("city") !="London")| (pl.col("is_married") ==True) & (pl.col("children") >=3) ) .collect() ),}filter_row_multiple_condition = run_test(test_dict, "Filter (complex)")
Note
Both libraries tackled this challenge quite well, yet Pandas struggled to keep pace with Polars. It’s intriguing to observe that while Pandas required nearly twice the time for the more intricate task, Polars managed to complete it in almost the same amount of time. Parallelization in action.
Performing operations on columns
Now, let’s roll up our sleeves and dive into performing some operations on the columns.
Single operation
As a single operation, we’ll simply calculate the century in which these individuals were born.
Let’s also explore what happens when performing multiple operations on the columns. We’ll mix things up with some string operations, mapping, and math calculations to see how these libraries handle it!
Once again, Polars takes the lead. While both libraries required more time for the task involving multiple operations, Polars demonstrated superior scalability in this scenario.
Concatenating Data
Now, let’s turn our attention to concatenating two datasets. Let the merging begin!
Once more, Polars shines with a remarkable speed advantage.
Aggregation
Simple
Time to shift our attention to aggregation. First up, a simple task: let’s calculate the mean income based on names. Then, for a bit more complexity, we’ll dive into computing statistics involving the income, children, and car columns. Things are about to get interesting!
While Pandas showcased noteworthy speed for the simple aggregation, the more intricate task exposed significant disparities between the two libraries. Polars took a commanding lead in this scenario, presenting a considerably faster performance compared to Pandas.
Whole workflow
We’re going to tackle this in two versions. First, we’ll use pl.read_csv for Polars, assessing the time it takes to complete the entire workflow on a DataFrame already in memory. Then, for the second version, we’ll employ pl.scan_csv for Polars. This nifty function lets us lazily read data from a CSV file or multiple files using glob patterns.
The cool part? With pl.scan_csv, the query optimizer can push down predicates and projections to the scan level. This nifty move has the potential to cut down on memory overhead. Let’s dive into both versions and see how they stack up!
As evident, the utilization of scan_csv increased the required time by about 3-4 times. However, even with this increase, Polars still manages to maintain a substantial advantage of around 5 times faster than the entire workflow executed using Pandas.
Note
When we consider the entirety of the data processing pipeline, irrespective of the file reading approach, Polars emerges as the victor. It consistently exhibits a considerable speed advantage compared to Pandas.
Putting it all together
Time to bring together all the things we’ve explored! Let’s sum up what we’ve learned.
Throughout our exploration, Polars has consistently outperformed Pandas. We intentionally focused on tasks that frequently arise in data analysis, encompassing file reading, column selection, filtering, and more. However, where Polars truly shines is in aggregation, capitalizing on its multiprocessing prowess. Take a glance at Figure 14, and you’ll notice that reading files is the most time-consuming step for Polars. To address this (and reduce memory usage at the same time), we have the nifty scan_csv function that enables Polars to operate in a lazy mode, optimizing the entire data pipeline. It’s clear that Polars packs a punch in the realm of data processing!
Should you consider using Polars? Based on the findings of this experiment, it’s evident that Polars can bring substantial benefits to every stage of your data processing. Its remarkable speed can make a significant impact. If you’re open to investing time in understanding its API (which isn’t drastically different from Pandas), I’m confident your projects will gain a significant speed boost. And don’t forget, the trajectory of the project seems promising, implying a bright future ahead.
Whatever path you choose, ensure you become well-acquainted with your selected library and stay current with emerging technologies. And above all, relish the journey. Happy coding, and until next time! 🐼🐍🤓💻
Source Code
---title: 'The Ultimate Speed Test: Pandas vs Polars 'author: Adam Cseresznyedate: '2023-08-13'categories: - Polars - Pandasjupyter: python3toc: trueformat: html: code-fold: true code-tools: true---#| fig-cap: Photo by Hans-Jurgen Mager on Unsplash#| label: fig-fig0![](hans-jurgen-mager-qQWV91TTBrE-unsplash.jpg){fig-align="center" width=50%}Once I shared my first article about ["Getting Started with Polars"](https://adamcseresznye.github.io/blog/posts/polars/Polars_revised.html) I started thinking about something exciting: comparing the speed of Polars and Pandas. This curiosity was sparked by all the buzz around the brand-new [Pandas 2.0 release](https://pandas.pydata.org/docs/dev/whatsnew/v2.0.0.html), promising lightning-fast performance. Pandas 2.0 was announced to come packed with cool features, including the addition of Apache Arrow (pyarrow) as its backing memory format. The big perk of Apache Arrow is that it makes operations speedier and more memory-friendly. Naturally, this got me wondering: how does Pandas 2.0 measure up against Polars? Let's dive in and find out!::: {.callout-warning}Keep in mind: occasionally, I refer to some performance difference figures. Even though I repeated these experiments 100 times each, because computations can have a bit of randomness, there might be slight variations in the exact numbers.:::# Setup```{python}import pandas as pdimport polars as plimport polars.selectors as csimport numpy as npimport timeitimport randomimport plotly.io as pioimport plotly.express as pxpio.templates.default ="presentation"print(pd.__version__)print(pl.__version__)```::: {.callout-note}Throughout this article, we'll be working with the latest versions of the packages: Pandas 2.0.3 and Polars 0.18.11.:::# Generate the synthetic dataFor the purpose of benchmarking, we're going to create our own dataset. To ensure simplicity and uniformity, we'll generate a dataframe containing columns like name, year of birth, and city of residence – pretty straightforward stuff. We're aiming for a dataset of 10,000 rows. While we could certainly run this benchmark on a dataframe with more than a million rows, our intention is to provide a realistic example. The code we'll provide allows you to effortlessly explore various scenarios according to your requirements. Feel free to use it or modify it if you want to perform your own test.```{python}def create_sample_dataframe(n_rows: int, name: str, seed: int=42) -> pl.DataFrame:""" Create a sample Polars DataFrame with the specified number of rows. This function generates random data for several columns and creates a Polars DataFrame with the specified number of rows. The generated data includes information about names, birth years, cities, zip codes, incomes, marital status, number of children, and car brands. The resulting DataFrame is written to CSV and Parquet files named 'sample.csv' and '{name}.parquet', respectively. Parameters ---------- n_rows : int The number of rows to generate in the sample DataFrame. name : str The name to be used when writing the Parquet file. seed : int, optional The seed to use when generating random data. Default is 42. Returns ------- pl.DataFrame The first 5 rows of the generated sample DataFrame. """ random.seed(seed) np.random.seed(seed) data = {"name": np.random.choice(["Alice", "Bob", "Charlie", "David", "Eva"], n_rows),"born": np.random.randint(1950, 1990, size=n_rows),"city": np.random.choice( ["Shanghai", "San Francisco", "London", "Munich", "Mumbai"], n_rows ),"zip_code": np.random.randint(10000, 99999, size=n_rows),"income": np.random.normal(50000, 10000, size=n_rows),"is_married": np.random.choice([True, False], n_rows),"children": np.random.randint(0, 5, size=n_rows),"car": np.random.choice(["Ford", "BMW", "Toyota", "Bentley", "Mini"], n_rows), } temp_df = pl.DataFrame(data) temp_df.write_csv("sample.csv") temp_df.write_parquet(f"{name}.parquet")return temp_df.head()create_sample_dataframe(n_rows=10_000, name="sample", seed=1)```# Define helper classIn order to streamline our code, we'll create a convenient helper class featuring two essential methods: `get_time` and `get_figure`. The `get_time` method leverages the `timeit.repeat` function to accurately gauge the execution time of our code. In this context, we'll set the parameters `number = 1` and `repeat = 100`. This configuration runs the operations 100 times, allowing us to derive key statistics like the mean, median, and standard deviation. On the other hand, the `get_figure` method takes these gathered results and generates a visually appealing image using Plotly. To bring it all together and keep things simple, we've designed the `run_test` function. This function orchestrates the entire process, seamlessly merging the steps outlined above.```{python}class Timeit:""" A class for measuring the execution time of multiple functions. This class provides methods for measuring the execution time of multiple functions using the `timeit` module. The functions to be tested are passed to the constructor as a dictionary, where the keys are the names of the functions and the values are the functions themselves. The `number` and `repeat` parameters control how many times each function is executed and how many times the timing is repeated, respectively. The `get_time` method measures the execution time of each function and stores the results in a dictionary, where the keys are the names of the functions and the values are lists of execution times. The `get_figure` method generates a box plot of the execution times using the `plotly.express.box` function. Parameters ---------- func_dict : dict A dictionary of functions to test, where the keys are the names of the functions and the values are the functions themselves. number : int, optional The number of times to execute each function in each timing run (default is 1). repeat : int, optional The number of times to repeat each timing run (default is 20). Attributes ---------- results : dict or None A dictionary of execution times for each function, where the keys are the names of the functions and the values are lists of execution times. This attribute is `None` until `get_time` is called. Methods ------- get_time() Measure the execution time of each function and store the results in the `results` attribute. get_figure() Generate a box plot of the execution times using `plotly.express.box`. """def__init__(self, func_dict, number=1, repeat=100):self.func_dict = func_dictself.number = numberself.repeat = repeatself.results =Nonedef get_time(self):""" Measure the execution time of each function and store the results. This method uses the `timeit.repeat` function to measure the execution time of each function in `func_dict`. The results are stored in a dictionary, where the keys are the names of the functions and the values are lists of execution times. This dictionary is also stored in the `results` attribute of the `Timeit` object. Returns ------- dict A dictionary of execution times for each function. """ results = {}for k, v inself.func_dict.items(): results[k] = timeit.repeat(v, number=self.number, repeat=self.repeat)self.results = resultsreturn resultsdef get_figure(self):""" Generate a box plot of the execution times. This method uses `plotly.express.box` to generate a box plot of the execution times stored in `results`.If `results` is `None`, this method raises a `ValueError`. Returns ------- plotly.graph_objs.Figure A box plot figure object. Raises ------ ValueError If no results are available (i.e., if `get_time` has not been called). """ifself.results isNone:raiseValueError("No results available. Please run get_time first.") df = pd.DataFrame(self.results) mean_times = df.median() *1000 title =f'Median Execution Times: <br><sup>{", ".join([f"{k}={v:.6f}ms"for k,v in mean_times.items()])}</sup>'return px.box( df, points="all", labels={"value": "time (sec)", "variable": ""}, title=title, width=600, height=500, )def run_test(test: dict, name: str):""" Run a timing test on a dictionary of functions using the Timeit class and return the results as a DataFrame. This function takes a dictionary of functions as input, where the keys are the names of the functions and the values are the functions themselves. It creates a Timeit object with this dictionary and uses it to measure the execution time of each function. The results are then displayed as a box plot using the `get_figure` method of the Timeit object and returned as a Pandas DataFrame with columns renamed according to the `name` argument. Parameters ---------- test : dict A dictionary of functions to test, where the keys are the names of the functions and the values are the functions themselves. name : str A string used to rename the columns of the returned DataFrame. Returns ------- pd.DataFrame A DataFrame containing the results of the timing test, with columns renamed according to the `name` argument. """# Create a Timeit object t = Timeit(test)# Measure the execution time of the functions results = t.get_time()# Generate a box plot of the execution times fig = t.get_figure() fig.show()return pd.DataFrame(results).rename( columns={"Pandas": f"{name}_pandas", "Polars": f"{name}_polars"} )```# Reading in data From a CSV File and parquet## CSVReading CSV files from disk is a task that data scientists often find themselves doing. Now, let's see how these two libraries compare for this particular job. To maximize the blazing-fast data handling capabilities of PyArrow, we'll equip Pandas with the `engine="pyarrow"` and `dtype_backend="pyarrow"` arguments. Let's see how these choices shape the performance!```{python}#| fig-cap: Reading in data From a CSV File#| label: fig-fig1test_dict = {"Pandas": lambda: pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow" ),"Polars": lambda: pl.read_csv("sample.csv"),}read_csv = run_test(test_dict, "Read csv")```For the sake of comparison, we'll also demonstrate the timeit function invoked using Jupyter cell magic. You'll notice that the numbers generated this way are quite closely aligned with ours.```{python}#| tags: []%%timeitpd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")``````{python}#| tags: []%%timeitpl.read_csv("sample.csv")```## parquetNow, let's read the data in Parquet format.```{python}#| fig-cap: Reading in data From a parquet File#| label: fig-fig2#| tags: []test_dict = {"Pandas": lambda: pd.read_parquet("sample.parquet", engine="pyarrow", dtype_backend="pyarrow" ),"Polars": lambda: pl.read_parquet("sample.parquet"),}read_parquet = run_test(test_dict, "Read parquet")```::: {.callout-note}Polars unquestionably wins this round, it can boast a speed advantage of 2 to 4 times over Pandas.:::# Selecting ColumnsAlright, let's spice things up a bit and select some columns to see which library races ahead in terms of speed!## Files awaiting in-memory readingA clever approach to conserve memory and enhance speed involves reading only the columns essential for operations. Consider a scenario where we're interested in displaying just the names from this dataset. The big question now: how do these libraries measure up in terms of speed? Let's find out!```{python}#| fig-cap: Selecting Columns from a File Not Yet in Memory#| label: fig-fig3test_dict = {"Pandas": lambda: pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow", usecols=["name"] ),"Polars": lambda: pl.scan_csv("sample.csv").select(pl.col("name")).collect(),}select_col_not_in_memory = run_test(test_dict, "Select column (not in memory)")```## File is in memoryAs anticipated, Polars continues to showcase its swiftness. It's worth highlighting the usage of the `lazy` and `collect` methods in Polars. These nifty tools grant us access to the library's clever query optimization techniques, which play a pivotal role in significantly enhancing performance. OK, one step further: suppose our files are already loaded into memory. Would there still be a distinction in performance under this circumstance?```{python}#| tags: []df_pandas = pd.read_csv("sample.csv", engine="pyarrow", dtype_backend="pyarrow")df_polars = pl.read_csv("sample.csv")``````{python}#| fig-cap: Selecting Columns from a File Already in Memory#| label: fig-fig4#| tags: []test_dict = {"Pandas": lambda: df_pandas.loc[:, "name"],"Polars": lambda: df_polars.lazy().select(pl.col("name")).collect(),}select_col_in_memory = run_test(test_dict, "Select column")```::: {.callout-note}While Polars showed a significant speed advantage for tasks involving pre-read files, both libraries perform similarly when the files are already in memory.:::# Filtering RowsNow, let's explore the scenario where we filter our dataset based on one or more column values ## Based on one conditionFor our simple scenario, we'll be narrowing down our focus to filter data based on individuals with the name "David".```{python}#| fig-cap: Filtering Rows Based on One Condition#| label: fig-fig5test_dict = {"Pandas": lambda: (df_pandas.query("name=='David'")),"Polars": lambda: (df_polars.lazy().filter((pl.col("name") =="David")).collect()),}filter_row_one_condition = run_test(test_dict, "Filter (simple)")```## Based on multiple conditionsNow, for a more intricate challenge, we're going to dive into querying the data to extract individuals who meet specific criteria: those named David, born after 1980, residing in a city other than London, married, and with three children.```{python}#| fig-cap: Filtering Rows Based on Multiple Condition#| label: fig-fig6test_dict = {"Pandas": lambda: ( df_pandas.query("name=='David' and born>1980 and city != 'London' or is_married == True and children >= 3" ) ),"Polars": lambda: ( df_polars.lazy() .filter( (pl.col("name") =="David")& (pl.col("born") >1980)& (pl.col("city") !="London")| (pl.col("is_married") ==True) & (pl.col("children") >=3) ) .collect() ),}filter_row_multiple_condition = run_test(test_dict, "Filter (complex)")```::: {.callout-note}Both libraries tackled this challenge quite well, yet Pandas struggled to keep pace with Polars. It's intriguing to observe that while Pandas required nearly twice the time for the more intricate task, Polars managed to complete it in almost the same amount of time. Parallelization in action.:::# Performing operations on columnsNow, let's roll up our sleeves and dive into performing some operations on the columns.## Single operationAs a single operation, we'll simply calculate the century in which these individuals were born.```{python}#| fig-cap: Performing a Singme Operation on a Column#| label: fig-fig7test_dict = {"Pandas": lambda: (df_pandas.assign(born=lambda df: df.born.div(100).round())),"Polars": lambda: ( df_polars.lazy().with_columns((pl.col("born") /100).round()).collect() ),}operate_one_column = run_test(test_dict, "Operate (one column)")```## Multiple operationsLet's also explore what happens when performing multiple operations on the columns. We'll mix things up with some string operations, mapping, and math calculations to see how these libraries handle it!```{python}#| fig-cap: Perfrming a Multiple Operation on Columns#| label: fig-fig8test_dict = {"Pandas": lambda: ( df_pandas.assign( born=lambda df: df.born.div(100).round(), name=lambda df: df.name.str.lower(), city=lambda df: df.city.str.upper(), zip_code=lambda df: df.zip_code.mul(2), income=lambda df: df.income.div(10).astype("str"), is_married=lambda df: df.is_married.map({False: 0, True: 1}), children=lambda df: df.children.astype("bool"), car=lambda df: df.car.str[0], ) ),"Polars": lambda: ( df_polars.lazy() .with_columns( [ (pl.col("born") /100).round(), pl.col("name").str.to_lowercase(), pl.col("city").str.to_uppercase(), pl.col("zip_code") *2, (pl.col("income") /10).cast(pl.Utf8), pl.col("is_married").map_dict({False: 0, True: 1}), pl.col("children").cast(pl.Boolean), pl.col("car").str.slice(0, length=1), ] ) .collect() ),}operate_multiple_column = run_test(test_dict, "Operate (more columns)")```::: {.callout-note}Once again, Polars takes the lead. While both libraries required more time for the task involving multiple operations, Polars demonstrated superior scalability in this scenario.:::# Concatenating DataNow, let's turn our attention to concatenating two datasets. Let the merging begin!```{python}df_pandas2 = df_pandas.copy(deep=True)df_polars2 = df_polars.clone()``````{python}#| fig-cap: Concatenating Dataframes#| label: fig-fig9test_dict = {"Pandas": lambda: pd.concat([df_pandas, df_pandas2], axis=0),"Polars": lambda: pl.concat([df_polars, df_polars2], how="vertical"),}concatenate = run_test(test_dict, "Concatenate")```::: {.callout-note}Once more, Polars shines with a remarkable speed advantage.:::# Aggregation## Simple Time to shift our attention to aggregation. First up, a simple task: let's calculate the mean income based on names. Then, for a bit more complexity, we'll dive into computing statistics involving the income, children, and car columns. Things are about to get interesting!```{python}#| fig-cap: Performing a simple aggregation#| label: fig-fig10test_dict = {"Pandas": lambda: (df_pandas.groupby("name").income.mean()),"Polars": lambda: (df_polars.lazy().groupby("name").mean().collect()),}aggregate_simple = run_test(test_dict, "Aggregate (simple)")```## More complex```{python}#| fig-cap: Performing a complex aggregation#| label: fig-fig11test_dict = {"Pandas": lambda: ( df_pandas.groupby(["name", "car", "is_married"]).agg( born_min=("born", min), born_max=("born", max), income_mean=("income", np.mean), income_median=("income", np.median), children_mean=("children", np.mean), car_count=("car", "count"), ) ),"Polars": lambda: ( df_polars.lazy() .groupby(["name", "car", "is_married"]) .agg( [ pl.col("born").min().alias("born_min"), pl.col("born").max().alias("born_max"), pl.col("income").mean().alias("income_mean"), pl.col("income").median().alias("income_median"), pl.col("children").mean().alias("children_mean"), pl.col("car").count().alias("car_count"), ] ) .collect() ),}aggregate_complex = run_test(test_dict, "Aggregate (complex)")```::: {.callout-note}While Pandas showcased noteworthy speed for the simple aggregation, the more intricate task exposed significant disparities between the two libraries. Polars took a commanding lead in this scenario, presenting a considerably faster performance compared to Pandas.:::# Whole workflowWe're going to tackle this in two versions. First, we'll use `pl.read_csv` for Polars, assessing the time it takes to complete the entire workflow on a DataFrame already in memory. Then, for the second version, we'll employ `pl.scan_csv` for Polars. This nifty function lets us lazily read data from a CSV file or multiple files using glob patterns.The cool part? With `pl.scan_csv`, the query optimizer can push down predicates and projections to the scan level. This nifty move has the potential to cut down on memory overhead. Let's dive into both versions and see how they stack up!## Using `pl.read_csv````{python}#| fig-cap: Performing a representative workflow#| label: fig-fig12test_dict = {"Pandas": lambda: ( df_pandas.loc[:, lambda df: ~df.columns.isin(["is_married"])] .query("name=='David' and born>1980 and city != 'London' and children >= 3") .assign( born=lambda df: df.born.div(100).round(), name=lambda df: df.name.str.lower(), city=lambda df: df.city.str.upper(), zip_code=lambda df: df.zip_code.mul(2), income=lambda df: df.income.div(10), children=lambda df: df.children.astype("bool"), car=lambda df: df.car.str[0], ) .groupby( ["name","car", ] ) .agg( born_min=("born", min), born_max=("born", max), income_mean=("income", np.mean), income_median=("income", np.median), children_mean=("children", np.mean), car_count=("car", "count"), ) ),"Polars": lambda: ( df_polars.lazy() .select(cs.all() - cs.ends_with("married")) .filter( (pl.col("name") =="David")& (pl.col("born") >1980)& (pl.col("city") !="London")& (pl.col("children") >=3) ) .with_columns( [ (pl.col("born") /100).round(), pl.col("name").str.to_lowercase(), pl.col("city").str.to_uppercase(), pl.col("zip_code") *2, (pl.col("income") /10), pl.col("children").cast(pl.Boolean), pl.col("car").str.slice(0, length=1), ] ) .groupby( ["name","car", ] ) .agg( [ pl.col("born").min().alias("born_min"), pl.col("born").max().alias("born_max"), pl.col("income").mean().alias("income_mean"), pl.col("income").median().alias("income_median"), pl.col("children").mean().alias("children_mean"), pl.col("car").count().alias("car_count"), ] ) .collect() ),}whole_workflow_read_csv = run_test(test_dict, "Whole workflow")```## Using `pl.scan_csv````{python}#| fig-cap: Performing a representative workflow (using `pl_scan_csv` for Polars)#| label: fig-fig13test_dict = {"Pandas": lambda: ( df_pandas.loc[:, lambda df: ~df.columns.isin(["is_married"])] .query("name=='David' and born>1980 and city != 'London' and children >= 3") .assign( born=lambda df: df.born.div(100).round(), name=lambda df: df.name.str.lower(), city=lambda df: df.city.str.upper(), zip_code=lambda df: df.zip_code.mul(2), income=lambda df: df.income.div(10), children=lambda df: df.children.astype("bool"), car=lambda df: df.car.str[0], ) .groupby( ["name","car", ] ) .agg( born_min=("born", min), born_max=("born", max), income_mean=("income", np.mean), income_median=("income", np.median), children_mean=("children", np.mean), car_count=("car", "count"), ) ),"Polars": lambda: ( pl.scan_csv("sample.csv") .select(cs.all() - cs.ends_with("married")) .filter( (pl.col("name") =="David")& (pl.col("born") >1980)& (pl.col("city") !="London")& (pl.col("children") >=3) ) .with_columns( [ (pl.col("born") /100).round(), pl.col("name").str.to_lowercase(), pl.col("city").str.to_uppercase(), pl.col("zip_code") *2, (pl.col("income") /10), pl.col("children").cast(pl.Boolean), pl.col("car").str.slice(0, length=1), ] ) .groupby( ["name","car", ] ) .agg( [ pl.col("born").min().alias("born_min"), pl.col("born").max().alias("born_max"), pl.col("income").mean().alias("income_mean"), pl.col("income").median().alias("income_median"), pl.col("children").mean().alias("children_mean"), pl.col("car").count().alias("car_count"), ] ) .collect() ),}whole_workflow_scan_csv = run_test(test_dict, "Whole workflow (scan_csv)")```As evident, the utilization of `scan_csv` increased the required time by about 3-4 times. However, even with this increase, Polars still manages to maintain a substantial advantage of around 5 times faster than the entire workflow executed using Pandas.::: {.callout-note}When we consider the entirety of the data processing pipeline, irrespective of the file reading approach, Polars emerges as the victor. It consistently exhibits a considerable speed advantage compared to Pandas.:::# Putting it all togetherTime to bring together all the things we've explored! Let's sum up what we've learned.```{python}#| fig-cap: Analyzing Speed Differences Between Pandas and Polars in Common Data Science Workflow Operations#| label: fig-fig14#| column: pagesummary = pd.concat( [ read_csv, read_parquet,# select_col_not_in_memory, select_col_in_memory, filter_row_one_condition, filter_row_multiple_condition, operate_one_column, operate_multiple_column, concatenate, aggregate_simple, aggregate_complex, whole_workflow_read_csv,# whole_workflow_scan_csv ], axis=1,)fig = ( summary.melt(var_name="experiment", value_name="time_sec") .assign( package=lambda df: df.experiment.str[-6:], experiment=lambda df: df.experiment.str[:-7], ) .groupby(["experiment", "package"]) .time_sec.mean() .mul(1000) .reset_index() .pivot(columns="package", values="time_sec", index="experiment") .sort_values( by="polars", ) .reset_index() .pipe(lambda df: px.bar( df, y="experiment", x=["pandas", "polars"], barmode="group", text_auto='.2s', orientation="h", labels={"value": "Average execution Time (msec)", "experiment": ""}, title=f'<b>Speed Showdown in Data Workflow Operations</b> <br><sup><i>Pandas {pd.__version__} vs. Polars {pl.__version__}</i></sup>', width=1000, height=700, ) ))fig.update_xaxes(tickangle=90)fig.update_layout(yaxis=dict(automargin=True), legend_title_text="Library")fig.update_traces(textposition='outside')```Throughout our exploration, Polars has consistently outperformed Pandas. We intentionally focused on tasks that frequently arise in data analysis, encompassing file reading, column selection, filtering, and more. However, where Polars truly shines is in aggregation, capitalizing on its multiprocessing prowess. Take a glance at @fig-fig14, and you'll notice that reading files is the most time-consuming step for Polars. To address this (and reduce memory usage at the same time), we have the nifty `scan_csv` function that enables Polars to operate in a lazy mode, optimizing the entire data pipeline. It's clear that Polars packs a punch in the realm of data processing!Should you consider using Polars? Based on the findings of this experiment, it's evident that Polars can bring substantial benefits to every stage of your data processing. Its remarkable speed can make a significant impact. If you're open to investing time in understanding its API (which isn't drastically different from Pandas), I'm confident your projects will gain a significant speed boost. And don't forget, the [trajectory](https://www.crunchbase.com/funding_round/polars-seed--963c89d2) of the project seems promising, implying a bright future ahead.Whatever path you choose, ensure you become well-acquainted with your selected library and stay current with emerging technologies. And above all, relish the journey. Happy coding, and until next time! 🐼🐍🤓💻